# Pckgs -------------------------------------
library(fs) # Cross-Platform File System Operations Based on 'libuv'
library(tidyverse) # Easily Install and Load the 'Tidyverse'
library(janitor) # Simple Tools for Examining and Cleaning Dirty Data
library(skimr) # Compact and Flexible Summaries of Data
library(here) # A Simpler Way to Find Your Files
library(paint) # paint data.frames summaries in colour
library(readxl) # Read Excel Files
library(tidytext) # Text Mining using 'dplyr', 'ggplot2', and Other Tidy Tools
library(SnowballC) # Snowball Stemmers Based on the C 'libstemmer' UTF-8 Library
library(rsample) # General Resampling Infrastructure
library(rvest) # Easily Harvest (Scrape) Web Pages
library(cleanNLP) # A Tidy Data Model for Natural Language Processing
library(kableExtra) # Construct Complex Table with 'kable' and Pipe Syntax)WB Project PDO text analysis
Work in progress
Set up
cleanNLP package
cleanNLP supports multiple backends for processing text, such as CoreNLP, spaCy, udpipe, and stanza. Each of these backends has different capabilities and might require different initialization procedures.
-
CoreNLP~ powerful Java-based NLP toolkit developed by Stanford, which includes many linguistic tools like tokenization, part-of-speech tagging, and named entity recognition.- ❕❗️ NEEDS EXTERNAL INSTALLATION (must be installed in Java with
cnlp_install_corenlp()which installs the Java JAR files and models)
- ❕❗️ NEEDS EXTERNAL INSTALLATION (must be installed in Java with
-
spaCy~ fast and modern NLP library written in Python. It provides advanced features like dependency parsing, named entity recognition, and tokenization.- ❕❗️ NEEDS EXTERNAL INSTALLATION (fust be installed in Python (with
spacy_install()which installs bothspaCyand necessary Python dependencies) and thespacyrR package must be installed to interface with it.
- ❕❗️ NEEDS EXTERNAL INSTALLATION (fust be installed in Python (with
-
udpipe~ R package that provides bindings to theUDPipeNLP toolkit. Fast, lightweight and language-agnostic NLP library for tokenization, part-of-speech tagging, lemmatization, and dependency parsing. -
stanza~ another modern NLP library from Stanford, similar to CoreNLP but built on PyTorch and supports over 66 languages…
when you initialize a backend (like CoreNLP) in cleanNLP, it stays active for the entire session unless you reinitialize or explicitly change it.
# ---- 1) Initialize the CoreNLP backend
library(cleanNLP)
cnlp_init_corenlp()
# If you want to specify a language or model path:
cnlp_init_corenlp(language = "en",
# model_path = "/path/to/corenlp-models"
)
# ---- 2) Initialize the spaCy backend
library(cleanNLP)
library(spacyr)
# Initialize spaCy in cleanNLP
cnlp_init_spacy()
# Optional: specify language model
cnlp_init_spacy(model_name = "en_core_web_sm")
# ---- 3) Initialize the udpipe backend
library(cleanNLP)
# Initialize udpipe backend
cnlp_init_udpipe(model_name = "english")
# ---- 4) Initialize the stanza backend—————————————————————————-
Data sources
WB Projects & Operations [CHECK 🔴]
World Bank Projects & Operations can be explored at:
- Data Catalog. From which
Accessibility Classification: public under Creative Commons Attribution 4.0
For example: https://datacatalog.worldbank.org/search/dataset/0037800 https://datacatalog.worldbank.org/search/dataset/0037800/World-Bank-Projects---Operations
—————————————————————————
Load pre-processed Projs’ PDO dataset pdo_train_t
Syntactic annotation is a computationally expensive operation, so I don’t want to repeat it every time I restart the session.
[Saved file projs_train_t ]
Done in ** analysis/_01a_WB_project_pdo_prep.qmd “**
- I retrieved manually ALL WB projects approved between FY 1947 and 2026 as of 31/08/2024 using simply the
Excel buttonon this page WBG Projects- By the way this is the link “list-download-excel”
- then saved HUUUGE
.xlsfiles indata/raw_data/project2/all_projects_as_of29ago2024.xls- (plus a
Rdatacopy of the original file )
- (plus a
- Split the dataset and keep only
projs_train(50% of projects with PDO text, i.e. 4413 PDOs) - Clean the dataset and save
projs_train_t(cleaned train dataset) - Obtain PoS tagging + tokenization with
cleanNLPpackage (functionscnlp_init_udpipe()+cnlp_annotate()) and savedprojs_train_t(cleaned train dataset).
Important mod
# Ensure token_id is numeric
pdo_train_t <- pdo_train_t %>%
mutate(tid = as.numeric(tid)) # Convert token_id to numericExplain Tokenization and PoS Tagging
i) Tokenization
Breaking units of language into components relevant for the research question into components relevant for the research question is called “tokenization”. Components can be words, ngrams, sentences, etc. or combining smaller units into larger units.
- Tokenization is a
row-wiseoperation: it changes the number of rows in the dataset.
The choices of tokenization
- Should words be lower cased?
- Should punctuation be removed?
- Should numbers be replaced by some placeholder?
- Should words be stemmed (also called lemmatization). ☑️
- Should bigrams/multi-word phrase be used instead of single word phrases?
- Should stopwords (the most common words) be removed? ☑️
- Should rare words be removed?
- Should hyphenated words be split into two words? ❌
for the moment I keep all as conservatively as possible
ii) Pos Tagging
Linguistic annotation is a common for of enriching text data, i.e. adding information about the text that is not directly present in the text itself.
Upon this, e.g. classifying noun, verb, adjective, etc., one can discover intent or action in a sentence, or scanning “verb-noun” patterns.
Here I have a training dataset file with:
| Variable | Type | Provenance | Description |
|---|---|---|---|
| proj_id | chr | original PDO data | |
| pdo | chr | original PDO data | |
| word_original | chr | original PDO data | |
| sid | int | output cleanNLP | sentence ID |
| tid | chr | output cleanNLP | token ID within sentence |
| token | chr | output cleanNLP | Tokenized form of the token. |
| token_with_ws | chr | output cleanNLP | Token with trailing whitespace |
| lemma | chr | output cleanNLP | The base form of the token |
| upos | chr | output cleanNLP | Universal part-of-speech tag (e.g., NOUN, VERB, ADJ). |
| xpos | chr | output cleanNLP | Language-specific part-of-speech tags. |
| feats | chr | output cleanNLP | Morphological features of the token |
| tid_source | chr | output cleanNLP | Token ID in the source document |
| relation | chr | output cleanNLP | Dependency relation between the token and its head token |
| pr_name | chr | output cleanNLP | Name of the parent token |
| FY_appr | dbl | original PDO data | |
| FY_clos | dbl | original PDO data | |
| status | chr | original PDO data | |
| regionname | chr | original PDO data | |
| countryname | chr | original PDO data | |
| sector1 | chr | original PDO data | |
| theme1 | chr | original PDO data | |
| lendinginstr | chr | original PDO data | |
| env_cat | chr | original PDO data | |
| ESrisk | chr | original PDO data | |
| curr_total_commitment | dbl | original PDO data |
— PoS Tagging: upos (Universal Part-of-Speech)
| upos | n | percent | explan |
|---|---|---|---|
| ADJ | 21852 | 0.0853714 | Adjective |
| ADP | 27848 | 0.1087965 | Adposition |
| ADV | 3010 | 0.0117595 | Adverb |
| AUX | 3738 | 0.0146036 | Auxiliary |
| CCONJ | 14486 | 0.0565939 | Coordinating conjunction |
| DET | 22121 | 0.0864223 | Determiner |
| INTJ | 81 | 0.0003165 | Interjection |
| NOUN | 72668 | 0.2838993 | Noun |
| NUM | 2285 | 0.0089270 | Numeral |
| PART | 8846 | 0.0345595 | Particle |
| PRON | 2351 | 0.0091849 | Pronoun |
| PROPN | 14860 | 0.0580550 | Proper noun |
| PUNCT | 29442 | 0.1150240 | Punctuation |
| SCONJ | 2219 | 0.0086692 | Subordinating conjunction |
| SYM | 348 | 0.0013596 | Symbol |
| VERB | 26397 | 0.1031278 | Verb |
| X | 3412 | 0.0133300 | Other |
On random visual check, these are not always correct, but they are a good starting point for now.
iii) Make low case
iv) Stemming
Using SnowballC::wordStem to stem the words. e.g.
Why Stemming?: For example, in topic modeling, stemming reduces noise by making it easier for the model to identify core topics without being distracted by grammatical variations. (Lemmatization is more computationally intensive as it requires linguistic context and dictionaries, making it slower, especially on large datasets)
| Token | Lemma | Stem |
|---|---|---|
| development | development | develop |
| quality | quality | qualiti |
| high-quality | high-quality | high-qual |
| include | include | includ |
| logistics | logistic | logist |
| government/governance | Governemnt/government/governance | govern |
NOTE: Among
word/stemsencountered in PDOs, there are a lot of acronyms which may refer to World Bank lingo, or local agencies, etc… Especially when looked at in low case form they don’t make much sense…
v) Document-term matrix or TF-IDF
The tf-idf is the product of the term frequency and the inverse document frequency::
\[ \begin{aligned} tf(\text{term}) &= \frac{n_{\text{term}}}{n_{\text{terms in document}}} \\ idf(\text{term}) &= \ln{\left(\frac{n_{\text{documents}}}{n_{\text{documents containing term}}}\right)} \\ tf\text{-}idf(\text{term}) &= tf(\text{term}) \times idf(\text{term}) \end{aligned} \]
— My own custom_stop_words |
Remove stop words, which are the most common words in a language.
- but I don’t want to remove any meaningful word for now
# Custom list of articles, prepositions, and pronouns
custom_stop_words <- c(
# Articles
"the", "a", "an",
"and", "but", "or", "yet", "so", "for", "nor", "as", "at", "by", "per",
# Prepositions
"of", "in", "on", "at", "by", "with", "about", "against", "between", "into", "through",
"during", "before", "after", "above", "below", "to", "from", "up", "down", "under",
"over", "again", "further", "then", "once",
# Pronouns
"i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your",
"yours", "yourself", "yourselves", "he", "him", "his", "himself", "she", "her",
"hers", "herself", "it", "its", "itself", "they", "them", "their", "theirs", "themselves" ,
"this", "that", "these", "those", "which", "who", "whom", "whose", "what", "where",
"when", "why", "how", "all", "any", "both", "each", "few", "more", "most", "other",
# "some", "such", "no", "not",
# "too", "very",
# verbs
"is", "are", "would", "could", "will", "be"
)
# Convert to a data frame if needed for consistency with tidytext
custom_stop_words_df <- tibble(word = custom_stop_words)— TF-IDF matrix on train pdo
# reduce size
pdo_train_4_tf_idf <- pdo_train_t %>% # 255964
# Keep only content words [very restrictive for now]
# normally c("NOUN", "VERB", "ADJ", "ADV")
filter(upos %in% c("NOUN")) %>% # 72,668
filter(!token_l %in% c("development", "objective", "project")) %>% # 66,741
# get rid of stop words (from default list)
filter(!token_l %in% custom_stop_words_df$word) %>% # 66,704
# Optional: Remove lemmas of length 1 or shorter
filter(nchar(lemma) > 1) # 66,350Now, count the occurrences of each lemma for each document. (This is the term frequency or tf)
With the lemma counts prepared, the bind_tf_idf() function from the tidytext package computes the TF-IDF scores.
# Compute the TF-IDF scores
lemma_tf_idf <- lemma_counts %>%
bind_tf_idf(lemma, proj_id, n) %>%
arrange(desc(tf_idf))What to use: token, lemma, or stem?
General Preference in Real-World NLP:
-
Tokensfor analyses where word forms matter or for sentiment analysis. -
Lemmas(*) for most general-purpose NLP tasks where you want to reduce dimensionality while maintaining accuracy and clarity of meaning. -
Stemsfor very large datasets, search engines, and applications where speed and simplicity are more important than linguistic precision.
(*) I use lemma, after “aggressively” reducing the number of words to consider, and removing stop words (at least for now).
_______
TEXT ANALYSIS/SUMMARY
_______
Frequencies of documents/words/stems
We are looking at (training data subset) pdo_train_t which has 255964 rows and 26 columns obtained from 4071 PDOs of 4413 Wold Bank projects approved in Fiscal Years ranging from 2001 to 2023.
| entity | counts |
|---|---|
| N proj | 4413 |
| N PDOs | 4071 |
| N words | 13231 |
| N token | 11399 |
| N lemma | 11474 |
| N stem | 8812 |
[FUNC] save plots
Term frequency
Note: normally, the most frequent words are function words (e.g. determiners, prepositions, pronouns, and auxiliary verbs), which are not very informative. Moreover, even content words (e.g. nouns, verbs, adjectives, and adverbs) can often be quite generic semantically speaking (e.g. “good” may be used for many different things).
In this analysis, Ido not use the STOPWORD approach, but use the POS tags to reduce our dataset to just the content words, that is nouns, verbs, adjectives, and adverbs
[FIG] Overall token freq ggplot
- Excluding “project” “develop”,“objective”
- Including only “content words” (NOUN, VERB, ADJ, ADV)
# Evaluate the title with glue first
title_text <- glue::glue("Most frequent token in {n_distinct(pdo_train_t$proj_id)} PDOs from projects approved between FY {min(pdo_train_t$FY_appr)}-{max(pdo_train_t$FY_appr)}")
proj_wrd_freq <- pdo_train_t %>% # 123,927
# include only content words
filter(upos %in% c("NOUN", "VERB", "ADJ", "ADV")) %>%
#filter (!(upos %in% c("AUX","CCONJ", "INTJ", "DET", "PART","ADP", "SCONJ", "SYM", "PART", "PUNCT"))) %>%
filter (!(relation %in% c("nummod" ))) %>% # 173,686
filter (!(token_l %in% c("pdo","project", "development", "objective","objectives", "i", "ii", "iii",
"is"))) %>% # whne it is VERB
count(token_l) %>%
filter(n > 800) %>%
mutate(token_l = reorder(token_l, n)) %>% # reorder values by frequency
# plot
ggplot(aes(token_l, n)) +
geom_col(fill = "gray") +
coord_flip() + # flip x and y coordinates so we can read the words better
labs(title = title_text,
subtitle = "[token_l count > 800]", y = "", x = "")+
theme(plot.title.position = "plot")
proj_wrd_freq[FIG] Overall stem freq ggplot
- Without “project” “develop”,“objective”
- Including only “content words” (NOUN, VERB, ADJ, ADV)
# Evaluate the title with glue first
title_text <- glue::glue("Most frequent STEM in {n_distinct(pdo_train_t$proj_id)} PDOs from projects approved between FY {min(pdo_train_t$FY_appr)}-{max(pdo_train_t$FY_appr)}")
# Plot
proj_stem_freq <- pdo_train_t %>% # 256,632
# include only content words
filter(upos %in% c("NOUN", "VERB", "ADJ", "ADV")) %>%
filter (!(relation %in% c("nummod" ))) %>% # 173,686
filter (!(stem %in% c("pdo","project", "develop", "object", "i", "ii", "iii"))) %>%
count(stem) %>%
filter(n > 800) %>%
mutate(stem = reorder(stem, n)) %>% # reorder values by frequency
# plot
ggplot(aes(stem, n)) +
geom_col(fill = "gray") +
coord_flip() + # flip x and y coordinates so we can read the words better
labs(title = title_text,
subtitle = "[stem count > 800]", y = "", x = "") +
theme(plot.title.position = "plot")
proj_stem_freqEvidently, after stemming, more words (or stems) reach the threshold frequency count of 800. # _______
_______
Create bigrams
Here I use [
clnp_annotate()output + ]dplyrto combine consecutive tokens into bigrams.
# Create bigrams by pairing consecutive tokens by sentence ID and token IDs
bigrams <- pdo_train_t %>%
# keeping FY with tokens
group_by(FY_appr, proj_id, pdo, sid ) %>%
arrange(tid) %>%
# Using mutate() and lead(), we create bigrams from consecutive tokens
mutate(next_token = lead(token),
bigram = paste(token, next_token)) %>%
# make bigram low case
mutate(bigram = tolower(bigram)) %>%
# only includes the rows where valid bigrams are formed
filter(!is.na(next_token)) %>%
ungroup() %>%
arrange(FY_appr, proj_id, sid, tid) %>%
select(FY_appr,proj_id, pdo,sid, tid, token, bigram) Clean bigrams
# Separate the bigram column into two words
bigrams_cleaned <- bigrams %>%
tidyr::separate(bigram, into = c("word1", "word2"), sep = " ")
# Remove stopwords and bigrams containing punctuation
bigrams_cleaned <- bigrams_cleaned %>%
# custom stop words
filter(!word1 %in% custom_stop_words_df$word, !word2 %in% custom_stop_words_df$word) %>%
# Remove punctuation
filter(!str_detect(word1, "[[:punct:]]"), !str_detect(word2, "[[:punct:]]"))
# Reunite the cleaned words into the bigram column
bigrams_cleaned <- bigrams_cleaned %>%
unite(bigram, word1, word2, sep = " ") %>%
# Remove too obvious bigrams
filter(!bigram %in% c("development objective", "development objectives",
"proposed project", "project development"))
# View the cleaned dataframe
bigrams_cleaned
# Count the frequency of each bigram
bigram_freq <- bigrams_cleaned %>%
count(bigram, sort = TRUE)[FIG] most frequent bigrams in PDOs
- Excluding “development objective”, “development objectives”, “proposed project”
- Excluding stopwords and bigrams containing punctuation
# ---- Prepare data for plotting
# Evaluate the title with glue first
title_text <- glue::glue("Frequency of bigrams in PDOs over FY {min(pdo_train_t$FY_appr)}-{max(pdo_train_t$FY_appr)}")
# Define the bigrams you want to highlight
bigrams_to_highlight <- c("public sector", "private sector")
# Plot the most frequent bigrams
pdo_bigr_freq <- bigram_freq %>%
slice_max(n, n = 25) %>%
ggplot(aes(x = reorder(bigram, n), y = n,
fill = ifelse(bigram %in% bigrams_to_highlight, bigram, "Other"))) +
geom_col() +
scale_fill_manual(values = c("public sector" = "#005ca1", "private sector" = "#e60066", "Other" = "grey")) +
guides(fill = "none") +
coord_flip() +
labs(title = title_text,x = "Bigram", y = "Frequency",
subtitle = "ranking first 20" ) +
theme(plot.title.position = "plot")
pdo_bigr_freqI wasn’t expecting “eligible crisis”?! I was expecting education to appear!?!?
[FIG] Changes over time ? [CMPL 🟠]
_______
Explore bigrams
_______
>>>>>> QUI <<<<<<<<<<<<<<<<<<
Main ref https://www.nlpdemystified.org/course/advanced-preprocessing rivedere cos’avevo fatto x pulire in analysis//03_WDR_pdotracs_explor.qmd https://cengel.github.io/R-text-analysis/textprep.html#detecting-patterns https://guides.library.upenn.edu/penntdm/r https://smltar.com/stemming#how-to-stem-text-in-r BOOK STEMMING # _______
Isolate other BIGRAM frequency…
[FIG] Most frequent bigrams
# Evaluate the title with glue first
title_text <- glue::glue("Most frequent token in {n_distinct(pdo_train_t$proj_id)} PDOs from projects approved between FY {min(pdo_train_t$FY_appr)}-{max(pdo_train_t$FY_appr)}")
pdo_bigr_freq <- pdo_train_t %>% # 123,927
# include only content words
filter(upos %in% c("NOUN", "VERB", "ADJ", "ADV")) %>%
#filter (!(upos %in% c("AUX","CCONJ", "INTJ", "DET", "PART","ADP", "SCONJ", "SYM", "PART", "PUNCT"))) %>%
filter (!(relation %in% c("nummod" ))) %>% # 173,686
filter (!(token_l %in% c("pdo","project", "development", "objective","objectives", "i", "ii", "iii",
"is"))) %>% # whne it is VERB
count(token_l) %>%
filter(n > 800) %>%
mutate(token_l = reorder(token_l, n)) %>% # reorder values by frequency
# plot
ggplot(aes(token_l, n)) +
geom_col(fill = "gray") +
coord_flip() + # flip x and y coordinates so we can read the words better
labs(title = title_text,
subtitle = "[token_l count > 800]", y = "", x = "")+
theme(plot.title.position = "plot")
proj_wrd_freq… [FIG] Notable bigrams (climate change)!
Word and document frequency: Tf-idf
The goal is to quantify what a document is about. What is the document about?
- term frequency (tf) = how frequently a word occurs in a document… but there are words that occur many time and are not important
- term’s inverse document frequency (idf) = decreases the weight for commonly used words and increases the weight for words that are not used very much in a collection of documents.
- statistic tf-idf (= tf-idf) = an alternative to using stopwords is the frequency of a term adjusted for how rarely it is used. [It measures how important a word is to a document in a collection (or corpus) of documents, but it is still a rule-of-thumb or heuristic quantity]
N-Grams
…
Co-occurrence
…
_______
TOPIC MODELING w ML
_______
Compare PDO text v. project METADATA [CMPL 🟠]
Using NLP models trained on document metadata and structure can be combined with text analysis to improve classification accuracy.
STEPS
- Use document text (abstracts) as features to train a supervised machine learning model. The labeled data (documents with sector tags) will serve as training data, and the model can predict the missing sector tags for unlabeled documents.
- TEXT preprocessing (e.g. tokenization, lemmatization, stopword removal, TF-IDF)
- Convert the processed text into a numerical format using Term Frequency-Inverse Document Frequency (TF-IDF), which gives more weight to terms that are unique to a document but less frequent across the entire corpus.
- Define data features, e.g.
- Document Length: Public sector documents might be longer, more formal.
- Presence of Certain Keywords: Use specific keywords that correlate with either the public or private sector.
- Sector Tags: In documents where the “sector tag” is present, you can use it as a feature for training.
- Predicting Missing Sector Tags (Classification):
- Use models like: Logistic Regression: For a binary classification (e.g., public vs. private). Random Forest or XGBoost: If you have a more complex tagging scheme (e.g., multiple sector categories).
- Cross-validation: Ensure the model generalizes well by validating with the documents that already have the sector tag filled in.
- Evaluate the model: Use metrics like accuracy, precision, recall, and F1 score to evaluate the model’s performance.
— I could see if corresponds to sector flags in the project metadata
more missing but more objective!
Topic modeling algorithms with Latent Dirichlet Allocation (LDA)
Topic modeling algorithms like Latent Dirichlet Allocation (LDA) can be applied to automatically uncover underlying themes within a corpus. The detected topics may highlight key terms or subject areas that are strongly associated with either the public or private sector.
Named Entity Recognition using CleanNLP and spaCy
NER is especially useful for analyzing unstructured text.
NER can identify key entities (organizations, people, locations) mentioned in the text. By tracking which entities appear frequently (e.g., government agencies vs. corporations), it’s possible to categorize a document as more focused on the public or private sector.
— Summarise the tokens by parts of speech
# Initialize the spacy backend
cnlp_init_spacy() quarto render analysis/01b_WB_project_pdo_anal.qmd --to html
open ./docs/analysis/01b_WB_project_pdo_anal.html